LDA

Linear Discriminant Analysis - LDA



Summary:

  • LDA assumes that the data points of each category are normally distributed.

  • Once the data is projected, LDA identifies another line and effectively separates the groups.

  • This separator can be used as a classifier or a dimension reduction technique.

  • When more variables are present, this separator can still be used regardless of the dimensions. (see below)



Example Code and Plot:

# fit the LDA model and retrieve the estimate for B
lda.obj = lda(type ~ ., data = reduced_bc)
Bhat = lda.obj$scaling

# Find the scores - the original observations in terms of our LD directions
lda.scores = as.matrix(reduced_bc[,-1])%*%lda.obj$scaling

# Only use the first two scores for our 2D plot
lda.scores.forplot = as.matrix(reduced_bc[,-1])%*%Bhat[,1:2]

# Hence, we will use this data to make our plot
data.forplot = data.frame(lda.scores.forplot, Type = reduced_bc$type)

# plot the scores and color them appropriately!
ggplot(data.forplot) + 
  geom_point(aes(x = LD1, y = LD2, color = Type), size = 2) + 
  ggtitle("Method: LDA") + 
  theme_bw()



When to use:

  • LDA works best when the predictors are linear, normally distributed, and the response is categorical.

  • Goal of LDA: to clearly separate the different categories’ data points on the lower dimension.

  • However, it is not as effective as t-SNE and UMAP when the dimension or number of variables is very high.

t-SNE

T-distributed Stochastic Neighbor Embedding - t-SNE



Summary:

  • t-SNE starts by measuring each data point similarity in both the high-dimensional space as well as the low-dimensional space.

  • In the high dimensional space, we assume a normal distribution, but in the low dimensional space, we assume a t-distribution.

  • Lastly, using Kullback-Liebler divergence makes the similarity matrices in the lower and higher dimensions more similar and easy to understand.

  • Linked below is an interactive website that provides a clear visualization of t-SNE:

https://pair-code.github.io/understanding-umap/



Example Code and Plot:

# run the TSNE
tsne <- Rtsne(as.matrix(reduced_bc[, c(-1)]),
              dims=2, perplexity=15, verbose=FALSE, max_iter=5000)

# color and name the groups
colors <- rainbow(length(unique(reduced_bc$type)))
names(colors) <- unique(reduced_bc$type)

# make the plot
plot(tsne$Y, t='n',
     main="Method: tSNE", xlab="tSNE Dimension 1", ylab="tSNE Dimension 2",
     xlim=c(-15, 15), ylim=c(-20, 20), cex.main=1.2, cex.lab=1)
points(tsne$Y[,1], tsne$Y[,2], col=alpha(colors[reduced_bc$type], 0.8), cex=0.9, pch=19)
legend("topright",legend=unique(reduced_bc$type), col=colors,
       pch=19, bty="n", pt.cex=1.2, cex=0.6, text.col=colors, horiz=F,
       inset=0.01, y.intersp=1.2)



When to use:

  • For unsupervised and nonlinear dimensionality reduction technique

  • It is well suited for embedding high dimension data into lower dimensional data (2D or 3D) for data visualization.

  • This technique is typically used for exploration.

UMAP

Uniform Manifold Approximation and Projection - UMAP



Summary:

  • This concept is very similar to t-SNE; however, instead of measuring the similarity between all the points, it creates a graph where only similarity between adjacent points is needed.

  • Similarly, it creates a lower-dimensional graph, which is optimized to look similar to the high-dimensional graph.

  • Linked below is an interactive UMAP website that gives more in-depth explanations of the tuning parameters:

https://pair-code.github.io/understanding-umap/



Example Code and Plot:

# split the label and data separately
reduce_bc_label = reduced_bc$type
reduce_bc_data = reduced_bc[,-1]



# run the umap function
reduce_bc_umap = umap(reduce_bc_data)



# UMAP plot written as a function
plot.reduce.bc = function(x, labels,
                          main="Method:UMAP",
                          colors=c("#ff7f00", "#e377c2", "#17becf"),
                          pad=0.1, cex=0.6, pch=19, add=FALSE, legend.suffix="",
                          cex.main=1, cex.legend=0.85){
  layout = x
  if (is(x, "umap")) {
    layout = x$layout
  }
  xylim = range(layout)
  xylim = xylim + ((xylim[2]-xylim[1])*pad)*c(-0.5, 0.5)
  if (!add) {
    par(mar=c(0.2,0.7,1.2,0.7), ps=10)
    plot(xylim, xylim, type="n", axes =F, frame=F)
    rect(xylim[1], xylim[1], xylim[2], xylim[2], border="#aaaaaa", lwd=0.25)
  }
  points(layout[,1], layout[,2], col=as.integer(as.factor(labels)),
         cex=cex, pch=pch)
  mtext(side=3, main, cex=cex.main)
  labels.u = unique(as.factor(labels))
  legend.pos = "topright"
  legend.text = as.character(labels.u)
  if (add) {
    legend.pos = "topright"
    legend.text = paste(as.character(labels.u), legend.suffix)
  }
  legend(legend.pos, legend=legend.text, inset=0.03,
         col=as.integer(labels.u),
         bty="n", pch=pch, cex=cex.legend)

}

# run the umap plot
plot.reduce.bc(reduce_bc_umap,reduce_bc_label)



When to use:

  • UMAP is an optimized extension of the tSNE, which allows any situations that can use tSNE to also utilize UMAP.

  • This method is less computationally extensive because it only considers the “neighbors” of each point when calculating the similarity measure and creates a graph accordingly.